In [ ]:
In [1]:
Step 1: Define your problem.
How to Define Your Machine Learning Problem
Step 2: Prepare your data.
How to Prepare Data For Machine Learning
How to Identify Outliers in your Data
Improve Model Accuracy with Data Pre-Processing
Discover Feature Engineering
An Introduction to Feature Selection
Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset
Data Leakage in Machine Learning
Step 3: Spot-check algorithms.
How to Evaluate Machine Learning Algorithms
Why you should be Spot-Checking Algorithms on your Machine Learning Problems
How To Choose The Right Test Options When Evaluating Machine Learning Algorithms
A Data-Driven Approach to Choosing Machine Learning Algorithms
Step 4: Improve results.
How to Improve Machine Learning Results
Machine Learning Performance Improvement Cheat Sheet
How To Improve Deep Learning Performance
Step 5: Present results.
How to Use Machine Learning Results
How To Deploy Your Predictive Model To Production
In [ ]:
In [ ]:
In [ ]:
Linear regression is an modeling approach to describe the relationship between a scalar dependent variable Y and one or more explanatory variables (or independent variables) denoted X.
In simple linear regression only one explanatory variable is present. In multiple linear regression more than one explanatory variable are present.
In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data and are called linear models.
Example of linear regression
A regression equation is a polynomial regression equation if the power of independent variable is more than 1. The equation below represents a polynomial equation:
y=a+b*x^2
In this regression technique, the best fit line is not a straight line. It is rather a curve that fits into the data points.
This form of regression is used when we deal with multiple independent variables. In this technique, the selection of independent variables is done with the help of an automatic process, which involves no human intervention.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Logistic regression is used to find the probability of event=Success and event=Failure. We should use logistic regression when the dependent variable is binary (0/ 1, True/ False, Yes/ No) in nature. Here the value of Y ranges from 0 to 1 and it can represented by following equation.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk
Above, p is the probability of presence of the characteristic of interest. A question that you should ask here is “why have we used log in the equation?”.
Since we are working here with a binomial distribution (dependent variable), we need to choose a link function which is best suited for this distribution. And, it is logit function. In the equation above, the parameters are chosen to maximize the likelihood of observing the sample values rather than minimizing the sum of squared errors (like in ordinary regression).
In [ ]:
In [ ]:
We are going to use sciket for our examples. Please download the following modules using conda
or pip
:
conda install pandas scikit scikit-learn matplotlib
or
pip pandas scikit scikit-learn matplotlib --user
In [ ]:
In [4]:
from sklearn import datasets
from sklearn.model_selection import cross_val_predict
from sklearn import linear_model
import matplotlib.pyplot as plt
lr = linear_model.LinearRegression()
boston = datasets.load_boston()
y = boston.target
# cross_val_predict returns an array of the same size as `y` where each entry
# is a prediction obtained by cross validation:
predicted = cross_val_predict(lr, boston.data, y, cv=10)
fig, ax = plt.subplots()
ax.scatter(y, predicted)
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()
R-square:
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.
The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is explained by a linear model. Or:
R-squared = Explained variation / Total variation
R-squared is always between 0 and 100%:
0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.
In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for this guideline that I’ll talk about both in this post and my next post.
Standard Error of the Regression (S)
In the regression output for Minitab statistical software, you can find S in the Summary of Model section, right next to R-squared. Both statistics provide an overall measure of how well the model fits the data. S is known both as the standard error of the regression and as the standard error of the estimate.
S represents the average distance that the observed values fall from the regression line. Conveniently, it tells you how wrong the regression model is on average using the units of the response variable. Smaller values are better because it indicates that the observations are closer to the fitted line.
In [ ]:
In [ ]:
In [ ]:
In [7]:
# Code source: Jaques Grobler
# License: BSD 3 clause
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
In [ ]:
In [ ]:
Mean squared error (MSE)
To compute the MSE, we first take the square of the difference between the actual and
predicted values of every record. We then take the average value of these squared
errors. If the predicted value of the i
th record is Pi and the actual value is Ai, then the
MSE is:
In [ ]:
In [ ]:
In [ ]: